From R. A. Fisher, 1938: To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of
https://github.com/rbind/simplystats
link to free online book “r4ds”: “R for Data Science”, Hadley Wickham & Garret Grolemund
https://www.gapminder.org/tools/#$state$time$value=2018;;&chart-type=bubbles
I find the the ggplot2 package for R the best plotting system for R. It’s syntax is an implementation of the ‘grammar of graphics’ and all plots in this demo are created by using ggplot2.
To cite ggplot2 in your work run citation("ggplot2") in R to get the reference details.
The 3rd edition of the free online book resource for ggplot2: “ggplot2 Elegant Graphics for Data Analysis”
{dslabs} and other packages useddata(package = "dslabs")
data("us_contagious_diseases", package = "dslabs")
unique(us_contagious_diseases$disease)## [1] Hepatitis A Measles Mumps Pertussis Polio Rubella
## [7] Smallpox
## Levels: Hepatitis A Measles Mumps Pertussis Polio Rubella Smallpox
Every data cycle starts with a question / problem:
library(dslabs)
data(us_contagious_diseases)
the_disease <- "Measles"
dat <- us_contagious_diseases %>%
dplyr::filter(!state%in%c("Hawaii","Alaska") &
disease == the_disease) %>%
mutate(rate = count / population * 10000 * 52 / weeks_reporting)
jet.colors <- colorRampPalette(c("#F0FFFF", "cyan", "#007FFF", "yellow", "#FFBF00", "orange", "red", "#7F0000"), bias = 2.25)
dat %>% mutate(state = reorder(state, desc(state))) %>%
ggplot(aes(year, state, fill = rate)) +
geom_tile(color = "white", size = 0.35) +
scale_x_continuous(expand = c(0,0)) +
scale_fill_gradientn(colors = jet.colors(16), na.value = 'white') +
geom_vline(xintercept = 1963, col = "black") +
theme_minimal() +
theme(panel.grid = element_blank()) +
coord_cartesian(clip = 'off') +
ggtitle(the_disease) +
ylab("") +
xlab("") +
theme(legend.position = "bottom", text = element_text(size = 8)) +
annotate(geom = "text", x = 1963, y = 50.5, label = "Vaccine introduced", size = 3, hjust = 0)Try changing the line in the code chunk above from
the_disease <- "Measles" (Mazelen in Dutch) to
the_disease <- "Smallpox" (Pokken in Dutch) and rerun the code.
Can you find out which part of the code above plots the vertical line and prints the words “Vaccine introduced”. Try to adapt the figure for the Smallpox data: The introduction of the smallpox vaccine was already as early as in 1940. TIP You will need to change the year 1963 twice to 1940 in oder for the new figure to be correct for the Smallpox subset of the data. Try adapting the code and rerun the code chunk.
Question: When did vaccination against smallpox stop? Try googling for the answer!
The gapminder dataset contains a number of measurements on health and income outcomes for 184 countries from 1960 to 2016. It also includes two character vectors, OECD and OPEC, with the names of OECD and OPEC countries from 2016.
## # A tibble: 2 x 9
## country year infant_mortality life_expectancy fertility population
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Albania 1960 115. 62.9 6.19 1636054
## 2 Algeria 1960 148. 47.5 7.65 11124892
## # ... with 3 more variables: gdp <dbl>, continent <fct>, region <fct>
## [1] "country" "year" "infant_mortality"
## [4] "life_expectancy" "fertility" "population"
## [7] "gdp" "continent" "region"
We call this ‘overplotting’.
This can be fixed in several ways:
"jitter") to pointsBasically you map an aesthetic (aes()) or a characteristic to a variable
Let’s go over these overplotting methods one by one
alpha) of points or lines in the dataalphagapminder %>%
ggplot(aes(x = fertility,
y = life_expectancy)) +
geom_point(aes(colour = continent), alpha = 0.1) +
guides(colour = guide_legend(override.aes = list(alpha = 1)))ggplot2 call. For example, adjust the alpha = ... or change the variable in x = ..., y = ... or colour = ...names(gapminder) gives you the variable names that you can changeguides(colour = guide_legend(override.aes = list(alpha = 1)))
#reduce_data_plot <- gapminder %>%
filter(continent == "Africa" | continent == "Europe") %>%
ggplot(aes(x = fertility,
y = life_expectancy)) +
geom_point(aes(colour = continent), alpha = 0.2) +
## override the alpha setting for the points in the legend:
guides(colour = guide_legend(override.aes = list(alpha = 1)))
reduce_data_plotaes() part of the geom_point() do?reduce_data_plot <- gapminder %>%
filter(continent == "Africa" | continent == "Europe") %>%
ggplot(aes(x = fertility,
y = life_expectancy, colour = continent)) +
geom_point(alpha = 0.2) +
## override the alpha setting for the points in the legend:
guides(colour = guide_legend(override.aes = list(alpha = 1))) ## or e.g. show only two years and map a shape to continent
shape_plot <- gapminder %>%
dplyr::filter(continent == "Africa" | continent == "Europe",
year == "1960" | year == "2010") %>%
ggplot(aes(x = fertility,
y = life_expectancy)) +
geom_point(aes(colour = as_factor(as.character(year)),
shape = continent),
alpha = 0.7)
shape_plotas_factor(as.character(year)) call and replace this by only year above and rerun the plot, what happened?Create panels for ‘categorical’ or so-called ‘factor’ variables in R
facets_plot <- gapminder %>%
dplyr::filter(continent == "Africa" | continent == "Europe",
year == "1960" | year == "2010") %>%
ggplot(aes(x = fertility,
y = life_expectancy)) +
geom_point(aes(colour = continent), alpha = 0.5) +
facet_wrap(~ year)
facets_plotlibrary(ggrepel)
years <- c("1960", "1970", "1980", "1990", "2000", "2010")
summarize_plot <- gapminder %>%
dplyr::filter(year %in% years) %>%
group_by(continent, year) %>%
summarise(mean_life_expectancy = mean(life_expectancy),
mean_fertility = mean(fertility)) %>%
ggplot(aes(x = mean_fertility,
y = mean_life_expectancy)) +
geom_point(aes(colour = continent), alpha = 0.7)
summarize_plot{ggrepel}library(ggrepel)
years <- c("1960", "1970", "1980", "1990", "2000", "2010")
labels_plot <- gapminder %>%
dplyr::filter(year %in% years) %>%
group_by(continent, year) %>%
summarise(mean_life_expectancy = mean(life_expectancy),
mean_fertility = mean(fertility)) %>%
ggplot(aes(x = mean_fertility,
y = mean_life_expectancy)) +
geom_point(aes(colour = continent), alpha = 0.7) +
geom_label_repel(aes(label=year), size = 2.5, box.padding = .5)
labels_plot## Model
lm <- gapminder %>% lm(formula = life_expectancy ~ fertility)
correlation <- cor.test(x = gapminder$fertility, y = gapminder$life_expectancy, method = "pearson")
# save predictions of the model in the new data frame
# together with variable you want to plot against
predicted_df <- data.frame(gapminder_pred = predict(lm, gapminder),
fertility = gapminder$fertility)model_plot <- gapminder %>%
ggplot(aes(x = fertility,
y = life_expectancy)) +
# geom_point(alpha = 0.03) +
geom_line(data = predicted_df, aes(x = fertility,
y = gapminder_pred),
colour = "darkred", size = 1)
model_plot{ggpubr} packagegeom_smooth to display potential relationshipsgapminder %>%
ggplot(aes(x = fertility,
y = life_expectancy)) +
geom_point(alpha = 0.02) +
geom_smooth(method = "lm") +
stat_cor(method = "pearson", label.x = 2, label.y = 30) +
theme_bw()Which tricks can we use to reduce the dimensionality of the plotted data (prevent overpltting)?
Try listing at least 6 methods:
gdp, Gross Domestic Product and infant_mortality rate.https://en.wikipedia.org/wiki/Gross_domestic_product Wikipedia: Gross Domestic Product (GDP) is a monetary measure of the market value of all the final goods and services produced in a period of time, often annually or quarterly. Nominal GDP estimates are commonly used to determine the economic performance of a whole country or region, and to make international comparisons.
gdp_infant_plot <- gapminder %>%
dplyr::filter(continent == "Europe" | continent == "Africa") %>%
ggplot(aes(x = gdp,
y = infant_mortality)) +
geom_point()
gdp_infant_plotThe figure above does not provide any clue on a possible difference between Europe and Africa, nor does it convey any information on trends over time.
colour_to_continent <- gapminder %>%
dplyr::filter(continent == "Europe" | continent == "Africa") %>%
ggplot(aes(x = gdp,
y = infant_mortality)) +
geom_point(aes(colour = continent))
colour_to_continentLet’s investigate whether things have improved over time. We compare 1960 to 2010 by using a panel of two figures. Adding simply facet_wrap( ~ facetting_variable) will do the trick.
Without looking ahead try to contruct a plot that conveys information on the gdp per continent, over time. Try to recycle some of the examples above.
facets_gdp_infant <- gapminder %>%
dplyr::filter(continent == "Europe" | continent == "Africa",
year == "1960" | year == "2010") %>%
ggplot(aes(x = gdp,
y = infant_mortality)) +
geom_point(aes(colour = continent)) +
facet_wrap(~ year) +
theme(axis.text.x = element_text(angle = -90, hjust = 1))
facets_gdp_infantSo far we have been mapping colours and shapes to categorical variables. You can also map to continuous variables though.
continuous <- gapminder %>%
dplyr::filter(country == "Netherlands" |
country == "China" |
country == "India") %>%
dplyr::filter(year %in% years) %>%
ggplot(aes(x = year,
y = life_expectancy)) +
geom_point(aes(size = population, colour = country)) +
guides(colour = guide_legend(override.aes = list(alpha = 1))) +
geom_line(aes(group = country)) +
theme_bw()
continuousTry plotting the infant_mortality against the filtered years for the same countries as the code above (Netherlands, India, China), recycling some of the code above. Discuss the resulting graph in the light of the life_expectancy graph, what do you think about the the developments in China?
Want to know more? see: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331212/ Babxiarz, 2016
Analyze the following code chunk: try running line by line to see what happens:
group_by() statement, what happens if you do?population_plot <- gapminder %>%
dplyr::group_by(continent, year) %>%
dplyr::filter(year %in% years) %>%
summarise(sum_population = sum(population)) %>%
ggplot(aes(x = year,
y = sum_population)) +
geom_point(aes(colour = continent)) +
geom_line(aes(group = continent,
colour = continent))
population_plotranking_plot <- gapminder %>%
dplyr::filter(continent == "Europe",
year == 2010) %>%
ggplot(aes(x = reorder(as_factor(country), population),
y = log10(population))) +
geom_point() +
ylab("log10(Population)") +
xlab("Country") +
coord_flip() +
geom_point(data = filter(gapminder %>%
dplyr::filter(continent == "Europe",
year == 2010), population >= 1e7), colour = "red")
ranking_plotWe filter for “Americas” and “Oceania” and look at life_expectancy over the years.
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
gapminder %>%
dplyr::filter(continent == "Americas" | continent == "Oceania") %>%
ggplot(aes(x = year,
y = life_expectancy)) +
geom_line(aes(group = continent,
colour = continent))Obviously something went wrong here. Please, discuss with your neighbour what you think happened or needs to be done to fix this (without looking ahead ;-) )
We can see what happened if we plot individual datapoints
gapminder %>%
dplyr::filter(continent == "Americas" | continent == "Oceania") %>%
ggplot(aes(x = year,
y = life_expectancy)) +
geom_point(aes(colour = country)) +
theme(legend.position="none") +
facet_wrap( ~ continent) +
theme(legend.position="none") ## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
gapminder %>%
dplyr::filter(continent == "Americas" | continent == "Oceania") %>%
group_by(continent, year) %>%
summarise(mean_life_expectancy = mean(life_expectancy)) %>%
ggplot(aes(x = year,
y = mean_life_expectancy)) +
geom_line(aes(group = continent,
colour = continent)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))df <- gapminder %>%
dplyr::filter(continent == "Americas" | continent == "Oceania") %>%
group_by(continent, year)
model <- aov(data = df, life_expectancy ~ continent * year)
anova(model)## Analysis of Variance Table
##
## Response: life_expectancy
## Df Sum Sq Mean Sq F value Pr(>F)
## continent 1 8982 8982 269.104 <2e-16 ***
## year 1 58606 58606 1755.931 <2e-16 ***
## continent:year 1 9 9 0.278 0.5981
## Residuals 2732 91183 33
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Some remarks on the above Two-way ANOVA:
Sometimes you have overlapping plots and adding transparency with alpha() or mapping colour to underlying categorical values is not working because there are simple to many points overlapping
Let’s look at an example
gapminder %>%
dplyr::filter(continent == "Americas" |
continent == "Africa") %>%
group_by(continent) %>%
dplyr::filter(year %in% years) %>%
ggplot(aes(x = year,
y = infant_mortality)) +
geom_point(aes(colour = country)) +
theme(legend.position="none")In such cases it can be helpfull to add some noise to the points (position = "jitter") to reduce overlapping. This can be a powerfull approach, especially when combined with setting alpha()
gapminder %>%
dplyr::filter(continent == "Americas" |
continent == "Africa") %>%
dplyr::filter(year %in% years) %>%
group_by(continent) %>%
ggplot(aes(x = year,
y = infant_mortality)) +
geom_point(aes(colour = continent), position = "jitter") It would be nice to know what the mean child mortality is for both continents
gapminder %>%
dplyr::filter(continent == "Americas" |
continent == "Africa") %>%
dplyr::filter(year %in% years) %>%
group_by(continent, year) %>%
summarise(mean_infant_mortality = mean(infant_mortality, na.rm = TRUE)) %>%
ggplot(aes(x = year,
y = mean_infant_mortality)) +
geom_col(aes(fill = continent), position = "dodge") Now that we have the mean infant mortality for each year for the two continents, let’s add that data to the previous dot plot where we used jitter
mean_inf_mort <- gapminder %>%
dplyr::filter(continent == "Americas" |
continent == "Africa") %>%
dplyr::filter(year %in% years) %>%
group_by(continent, year) %>%
summarise(mean_infant_mortality = mean(infant_mortality, na.rm = TRUE))
gapminder %>%
dplyr::filter(continent == "Americas" |
continent == "Africa") %>%
dplyr::filter(year %in% years) %>%
group_by(continent) %>%
ggplot(aes(x = year,
y = infant_mortality)) +
geom_point(aes(colour = continent), position = "jitter") +
## summary data added to previous
geom_line(data = mean_inf_mort, aes(x = year,
y = mean_infant_mortality,
colour = continent), size = 2)In the figure above we can observe a number of countries in ‘Americas’ continent that have a child mortality that are above the average (over the years) of ‘Africa’. Which countries are this?
library(ggiraph)
gapminder$country <-
str_replace_all(string = gapminder$country,
pattern = "'",
replacement = "_")
interactive_inf_mort <- gapminder %>%
dplyr::filter(continent == "Americas" |
continent == "Africa") %>%
dplyr::filter(year %in% years) %>%
group_by(region, country) %>%
ggplot(aes(x = year,
y = infant_mortality)) +
geom_point_interactive(aes(tooltip = country, colour = region), position = "jitter") +
# geom_point(aes(colour = continent), position = "jitter") +
## summary data added to previous
geom_line(data = mean_inf_mort, aes(x = year,
y = mean_infant_mortality,
colour = continent, group = continent), size = 2
)
interactive_inf_mort## [1] "Albania" "Algeria"
## [3] "Angola" "Antigua and Barbuda"
## [5] "Argentina" "Armenia"
## [7] "Aruba" "Australia"
## [9] "Austria" "Azerbaijan"
## [11] "Bahamas" "Bahrain"
## [13] "Bangladesh" "Barbados"
## [15] "Belarus" "Belgium"
## [17] "Belize" "Benin"
## [19] "Bhutan" "Bolivia"
## [21] "Bosnia and Herzegovina" "Botswana"
## [23] "Brazil" "Brunei"
## [25] "Bulgaria" "Burkina Faso"
## [27] "Burundi" "Cambodia"
## [29] "Cameroon" "Canada"
## [31] "Cape Verde" "Central African Republic"
## [33] "Chad" "Chile"
## [35] "China" "Colombia"
## [37] "Comoros" "Congo, Dem. Rep."
## [39] "Congo, Rep." "Costa Rica"
## [41] "Cote d_Ivoire" "Croatia"
## [43] "Cuba" "Cyprus"
## [45] "Czech Republic" "Denmark"
## [47] "Djibouti" "Dominican Republic"
## [49] "Ecuador" "Egypt"
## [51] "El Salvador" "Equatorial Guinea"
## [53] "Eritrea" "Estonia"
## [55] "Ethiopia" "Fiji"
## [57] "Finland" "France"
## [59] "French Polynesia" "Gabon"
## [61] "Gambia" "Georgia"
## [63] "Germany" "Ghana"
## [65] "Greece" "Greenland"
## [67] "Grenada" "Guatemala"
## [69] "Guinea" "Guinea-Bissau"
## [71] "Guyana" "Haiti"
## [73] "Honduras" "Hong Kong, China"
## [75] "Hungary" "Iceland"
## [77] "India" "Indonesia"
## [79] "Iran" "Iraq"
## [81] "Ireland" "Israel"
## [83] "Italy" "Jamaica"
## [85] "Japan" "Jordan"
## [87] "Kazakhstan" "Kenya"
## [89] "Kiribati" "South Korea"
## [91] "Kuwait" "Kyrgyz Republic"
## [93] "Lao" "Latvia"
## [95] "Lebanon" "Lesotho"
## [97] "Liberia" "Libya"
## [99] "Lithuania" "Luxembourg"
## [101] "Macao, China" "Macedonia, FYR"
## [103] "Madagascar" "Malawi"
## [105] "Malaysia" "Maldives"
## [107] "Mali" "Malta"
## [109] "Mauritania" "Mauritius"
## [111] "Mexico" "Micronesia, Fed. Sts."
## [113] "Moldova" "Mongolia"
## [115] "Montenegro" "Morocco"
## [117] "Mozambique" "Namibia"
## [119] "Nepal" "Netherlands"
## [121] "New Caledonia" "New Zealand"
## [123] "Nicaragua" "Niger"
## [125] "Nigeria" "Norway"
## [127] "Oman" "Pakistan"
## [129] "Panama" "Papua New Guinea"
## [131] "Paraguay" "Peru"
## [133] "Philippines" "Poland"
## [135] "Portugal" "Puerto Rico"
## [137] "Qatar" "Romania"
## [139] "Russia" "Rwanda"
## [141] "St. Lucia" "St. Vincent and the Grenadines"
## [143] "Samoa" "Saudi Arabia"
## [145] "Senegal" "Serbia"
## [147] "Seychelles" "Sierra Leone"
## [149] "Singapore" "Slovak Republic"
## [151] "Slovenia" "Solomon Islands"
## [153] "South Africa" "Spain"
## [155] "Sri Lanka" "Sudan"
## [157] "Suriname" "Swaziland"
## [159] "Sweden" "Switzerland"
## [161] "Syria" "Tajikistan"
## [163] "Tanzania" "Thailand"
## [165] "Timor-Leste" "Togo"
## [167] "Tonga" "Trinidad and Tobago"
## [169] "Tunisia" "Turkey"
## [171] "Turkmenistan" "Uganda"
## [173] "Ukraine" "United Arab Emirates"
## [175] "United Kingdom" "United States"
## [177] "Uruguay" "Uzbekistan"
## [179] "Vanuatu" "Venezuela"
## [181] "West Bank and Gaza" "Vietnam"
## [183] "Yemen" "Zambia"
## [185] "Zimbabwe"
west <- c("Western Europe","Northern Europe","Southern Europe",
"Northern America","Australia and New Zealand")
gapminder <- gapminder %>%
mutate(group = case_when(
region %in% west ~ "The West",
region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
region %in% c("Caribbean", "Central America", "South America") ~ "Latin America",
continent == "Africa" & region != "Northern Africa" ~ "Sub-Saharan Africa",
TRUE ~ "Others"))
gapminder <- gapminder %>%
mutate(group = factor(group, levels = rev(c("Others", "Latin America", "East Asia","Sub-Saharan Africa", "The West"))))
filter(gapminder, year%in%c(1962, 2013) & !is.na(group) &
!is.na(fertility) & !is.na(life_expectancy)) %>%
mutate(population_in_millions = population/10^6) %>%
ggplot( aes(fertility, y=life_expectancy, col = group, size = population_in_millions)) +
geom_point(alpha = 0.8) +
guides(size=FALSE) +
theme(plot.title = element_blank(), legend.title = element_blank()) +
coord_cartesian(ylim = c(30, 85)) +
xlab("Fertility rate (births per woman)") +
ylab("Life Expectancy") +
geom_text(aes(x=7, y=82, label=year), cex=12, color="grey") +
facet_grid(. ~ year) +
theme(strip.background = element_blank(),
strip.text.x = element_blank(),
strip.text.y = element_blank(),
legend.position = "top")For this part we use a different and more simple dataset This dataset contains 1192 observations on self-reported:
height (inch)earn ($)sex (gender)ed (currently unannotated)age (years)race## # A tibble: 1,192 x 6
## earn height sex ed age race
## <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 50000 74.4 male 16 45 white
## 2 60000 65.5 female 16 58 white
## 3 30000 63.6 female 16 29 white
## 4 50000 63.1 female 16 91 other
## 5 51000 63.4 female 17 39 white
## 6 9000 64.4 female 15 26 white
## 7 29000 61.7 female 12 49 white
## 8 32000 72.7 male 17 46 white
## 9 2000 72.0 male 15 21 hispanic
## 10 27000 72.2 male 12 26 white
## # ... with 1,182 more rows
We will focus on the variable height here
summary_heights_data <- heights_data %>%
group_by(sex, age) %>%
summarise(mean_height = mean(height, na.rm = TRUE),
min_height = min(height),
max_height = max(height)) %>%
arrange(desc(mean_height))
summary_heights_data[c(1:4),]## # A tibble: 4 x 5
## # Groups: sex [2]
## sex age mean_height min_height max_height
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 female 55 141. 61.9 664.
## 2 male 39 134. 66.6 572.
## 3 male 55 73.2 71.7 74.8
## 4 male 91 73.1 73.1 73.1
From the above summary we can conclude that there are two outliers (presumably entry errors).
Calculate the height in meters for each outlier in the Console 1 inch = 0,0254 meters
Please discuss the solution with your neighbour
This distribution looks odd. When you see a large x-axis with no data plotted on it, it usually means there is an outlier. If you look carefully, you will spot two outliers around 600
So apparantly there is one data point that is way off from the rest of the distribution. Let’s remove this point, using filter() from the {dplyr} package like we did before on the gapminder dataset.
## by sex
heights_data %>%
dplyr::filter(height < 100) %>%
ggplot(aes(y = height, x = sex)) +
geom_boxplot()Now let’s plot a new distribution plot, this time we plot density, leaving the outlier out
heights_data %>%
dplyr::filter(height < 100) %>%
ggplot(aes(height)) +
geom_freqpoly(aes(y = ..density..))## by sex
heights_data %>%
dplyr::filter(height < 100) %>%
ggplot(aes(height)) +
geom_freqpoly(aes(y = ..density.., colour = sex))The best way to spot and take care of an outlier (you never delete an outlier unsubstantiated!!) is to create multiple visualizations as decribed above. However, there is a formal, simple statistical apporach to assess whether an observation is an outlier. The apporach below can only be used to detect a single outlier in a numeric vector. For more complex (multivariate) situations see e.g.: http://r-statistics.co/Outlier-Treatment-With-R.html
The {outliers} package contains a number of tools. Here we will demonstrate the outlier() and dixon.test() functions. For more information on the DQT, see: https://www.statisticshowto.datasciencecentral.com/dixons-q-test/
outlier() functionThis function finds the value with largest difference between it and sample mean, which can be an outlier.
We will walk trough a simple exercise that first shows the result of the outlier() function and than we calculate it by hand
First we create dummy data for a triplicate series of measurements. We assume 5 measurement values, replicated three times, containing one outlier. We also assume is was a concentration depended series in which the index indicates the used concentration.
set.seed(123)
## create data
random_numbers <- rnorm(n = 15, mean = 2.5, sd = 0.9)
## inject outlier
random_numbers[4] <- 9.54
## define triplicates
triplicates <- rep(1:3, times = 5)
index <- rep(1:5, times = 3)
## put in dataframe
measured_data <- tibble(measured = random_numbers, replicate = triplicates,
index = index)
measured_data## # A tibble: 15 x 3
## measured replicate index
## <dbl> <int> <int>
## 1 2.00 1 1
## 2 2.29 2 2
## 3 3.90 3 3
## 4 9.54 1 4
## 5 2.62 2 5
## 6 4.04 3 1
## 7 2.91 1 2
## 8 1.36 2 3
## 9 1.88 3 4
## 10 2.10 1 5
## 11 3.60 2 1
## 12 2.82 3 2
## 13 2.86 1 3
## 14 2.60 2 4
## 15 2.00 3 5
## # A tibble: 15 x 3
## measured replicate index
## <dbl> <int> <int>
## 1 9.54 1 4
## 2 4.04 3 1
## 3 3.90 3 3
## 4 3.60 2 1
## 5 2.91 1 2
## 6 2.86 1 3
## 7 2.82 3 2
## 8 2.62 2 5
## 9 2.60 2 4
## 10 2.29 2 2
## 11 2.10 1 5
## 12 2.00 3 5
## 13 2.00 1 1
## 14 1.88 3 4
## 15 1.36 2 3
## plot
measured_data %>%
# enframe() %>%
ggplot(aes(x = index, y = measured)) +
geom_point() +
geom_point(data = dplyr::filter(.data = measured_data, measured > 5),
colour = "red", size = 6, shape = 4)outliers::outlier()## [1] 9.54
outliers::dixon.test()## note that these functions from the {outliers} package take a vector (use $)
outliers::dixon.test(measured_data$measured)##
## Dixon test for outliers
##
## data: measured_data$measured
## Q = 0.7472, p-value < 2.2e-16
## alternative hypothesis: highest value 9.54 is an outlier
The test is significant against the NULL hypothesis that the highest value is not an outlier. We can conclude that this NULL hypothesis does not hold on the basis of the significant p-value < 0.05
outlier()
mean_vctr <- mean(measured_data$measured)
distance <- measured_data$measured - mean_vctr
ind <- distance == max(distance)
measured_data[ind,]## # A tibble: 1 x 3
## measured replicate index
## <dbl> <int> <int>
## 1 9.54 1 4
dixon.test()
The Dixon’s Q Test can be calculated manually following the following procedure:
How to Run Dixon’s Q Test (R10).
Note: make sure your data set is normally distributed before running the test; for example, run a Shapiro-Wilk test. Running it on different distributions will lead to erroneous results. An extreme value may throw off any test for normality, so try running that test without the suspect data item. If your data set still doesn’t meet the assumption of normality after running a test for it, then you should not run Dixon’s Q Test.
Note that for very small sample sizes the assumtion for Normal distribution usually does not hold. We will see more formal way of establishing Normal distribution below.
Caution: the test should not be used more than once for the same set of data.
To perform the calculations:
Step 1: Sort your data into ascending order (smallest to largest).
Step 2 :Find the Q statistic using the following formula:
\(Qexp = \frac{x2 - x1}{xn - x1}\)
Where
Dixon’s Q Test: Definition, Step by Step Examples + Q Critical Values Tables
Find Outliers > Dixon’s Q Test What is Dixon’s Q Test?
Dixon’s Q test, or just the “Q Test” is a way to find outliers in very small, normally distributed, data sets. Small data sets are usually defined as somewhere between 3 and 7 items. It’s commonly used in chemistry, where data sets sometimes include one suspect observation that’s much lower or much higher than the other values. Keeping an outlier in data affects calculations like the mean and standard deviation, so true outliers should be removed.
Dixon came up with many different equations to find true outliers. The most commonly used one is called the R10 or simply the “Q” version, which is used to test if one single value is an outlier in a sample size of between 3 and 7. Dean and Dixon did suggest various other formulas in a later paper, but these are not commonly used. For a full list of alternate formulas for different sample sizes (up to about 30), go to: Dixon’s Test, Alternate Formulas and Tables. How to Run Dixon’s Q Test (R10).
Note: make sure your data set is normally distributed before running the test; for example, run a Shapiro-Wilk test. Running it on different distributions will lead to erroneous results. An extreme value may throw off any test for normality, so try running that test without the suspect data item. If your data set still doesn’t meet the assumption of normality after running a test for it, then you should not run Dixon’s Q Test,
Caution: the test should not be used more than once for the same set of data.
Step 1: Sort your data into ascending order (smallest to largest). 167, 177, 180, 181, 185, 188, 189.
Step 2 :Find the Q statistic using the following formula: dixon’s q test statistic
Where:
x1 is the smallest (suspect) value,
x2 is the second smallest value,
and xn is the largest value.
Step 3: Find the Q critical value in the Q table (see below). For a sample size of 3 and an alpha level of 5%, the critical value is 0.970.
Step 4: Compare the Q statistic from Step 2 with the Q critical value in Step 3. If the Q statistic is greater than the Q critical value, the point is an outlier.
Let’s try it on our measured_data
## [1] 1.361445 1.881832 1.995572 1.999743 2.098904 2.292840 2.599614
## [8] 2.616359 2.823832 2.860694 2.914825 3.601674 3.902837 4.043558
## [15] 9.540000
x1 <- measured_data_ranked$measured[1]
x2 <- measured_data_ranked$measured[2]
xn <- max(measured_data_ranked$measured)
Qexp = (x2-x1) / (xn-x1)
##
0.5/8## [1] 0.0625
a qqplot provides a visual aid to assess whether a distribution is approaching normality
##
source(file = here::here("code", "ggqq.R"))
height_data_outlier_removed <- heights_data %>%
dplyr::filter(height < 100)
gg_qq(height_data_outlier_removed$height) ## 25% 75%
## 66.926998 4.328462
##
## Shapiro-Wilk normality test
##
## data: height_data_outlier_removed$height
## W = 0.98485, p-value = 8.491e-10
all data -> reject hypothesis that the sample has a normal distribution
males <- height_data_outlier_removed %>%
dplyr::filter(sex == "male")
females <- height_data_outlier_removed %>%
dplyr::filter(sex == "female")
shapiro.test(males$height)##
## Shapiro-Wilk normality test
##
## data: males$height
## W = 0.99053, p-value = 0.002532
##
## Shapiro-Wilk normality test
##
## data: females$height
## W = 0.99277, p-value = 0.002105
##
## Shapiro-Wilk normality test
##
## data: males$age
## W = 0.93358, p-value = 3.506e-14
##
## Shapiro-Wilk normality test
##
## data: females$age
## W = 0.93978, p-value = 4.862e-16